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Abstract — There has been renewed interest in decoding Reed- 
Solomon (RS) codes without using syndromes recently. In this 
paper, we investigate the complexity of syndromeless decoding, 
and compare it to that of syndrome-based decoding. Aiming 
to provide guidelines to practical applications, our complexity 
analysis focuses on RS codes over characteristic-2 fields, for 
which some multiplicative FFT techniques are not applicable. 
Due to moderate block lengths of RS codes in practice, our 
analysis is complete, without big O notation. In addition to fast 
implementation using additive FFT techniques, we also consider 
direct implementation, which is still relevant for RS codes with 
moderate lengths. For high rate RS codes, when compared 
to syndrome-based decoding algorithms, not only syndromeless 
decoding algorithms require more field operations regardless of 
implementation, but also decoder architectures based on their 
direct implementations have higher hardware costs and lower 
throughput. We also derive tighter bounds on the complexities 
of fast polynomial multiplications based on Cantor's approach 
and the fast extended Euclidean algorithm. 

Index Terms — Reed-Solomon codes, Decoding, Complexity 
theory, Galois fields, Discrete Fourier transforms, Polynomials 



I. Introduction 

Reed-Solomon (RS) codes are among the most widely used 
error control codes, with applications in space communica- 
tions, wireless communications, and consumer electronics [1]. 
As such, efficient decoding of RS codes is of great interest. 
The majority of the applications of RS codes use syndrome- 
based decoding algorithms such as the Berlekamp-Massey 
algorithm (BMA) [2] or the extended Euclidean algorithm 
(EEA) [3]. Alternative hard decision decoding methods for 
RS codes without using syndromes were considered in [4]- 
[6]. As pointed out in [7], [8], these algorithms belong to 
the class of frequency-domain algorithms and are related to 
the Welch-Berlekamp algorithm [9]. In contrast to syndrome- 
based decoding algorithms, these algorithms do not compute 
syndromes and avoid the Chien search and Forney's formula. 
Clearly, this difference leads to the question whether these al- 
gorithms offer lower complexity than syndrome-based decod- 
ing, especially when fast Fourier transform (FFT) techniques 
are applied [6]. 

Asymptotic complexity of syndromeless decoding was an- 
alyzed in [6], and in [7] it was concluded that syndromeless 
decoding has the same asymptotic complexity 0(n log 2 n]Q 

The material in this paper was presented in part at the IEEE Workshop on 
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'Note that all the logarithms in this paper are to base two. 



as syndrome-based decoding [10]. However, existing asymp- 
totic complexity analysis is limited in several aspects. For 
example, for RS codes over Fermat fields GF(2 2 + 1) 
and other prime fields [5], [6], efficient multiplicative FFT 
techniques lead to an asymptotic complexity of 0(n log 2 n). 
However, such FFT techniques do not apply to characteristic- 
2 fields, and hence this complexity is not applicable to RS 
codes over characteristic-2 fields. For RS codes over arbitrary 
fields, the asymptotic complexity of syndromeless decoding 
based on multiplicative FFT techniques was shown to be 
0(n log 2 n log log n) [6]. Although they are applicable to RS 
codes over characteristic-2 fields, the complexity has large 
coefficients and multiplicative FFT techniques are less effi- 
cient than fast implementation based on additive FFT for RS 
codes with moderate block lengths [6], [11], [12]. As such, 
asymptotic complexity analysis provides little help to practical 
applications. 

In this paper, we analyze the complexity of syndromeless 
decoding and compare it to that of syndrome-based decoding. 
Aiming to provide guidelines to system designers, we focus 
on the decoding complexity of RS codes over GF(2 m ). Since 
RS codes in practice have moderate lengths, our complex- 
ity analysis provides not only the coefficients for the most 
significant terms, but also the following terms. Due to their 
moderate lengths, our comparison is based on two types of 
implementations of syndromeless decoding and syndrome- 
based decoding: direct implementation and fast implemen- 
tation based on FFT techniques. Direct implementations are 
often efficient when decoding RS codes with moderate lengths 
and have widespread applications; thus, we consider both 
computational complexities, in terms of field operations, and 
hardware costs and throughputs. For fast implementations, we 
consider their computational complexities only and their hard- 
ware implementations are beyond the scope of this paper. We 
use additive FFT techniques based on Cantor's approach [13] 
since this approach achieves small coefficients [6], [11] and 
hence is more suitable for moderate lengths. In contrast to 
some previous works [12], [14], which count field multiplica- 
tions and additions together, we differentiate the multiplicative 
and additive complexities in our analysis. 

The main contributions of the papers are: 

> We derived a tighter bound on the complexities of fast 
polynomial multiplication based on Cantor's approach; 

• We also obtained a tighter bound on the complexity of 
the fast extended Euclidean algorithm (FEEA) for general 
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partial greatest common divisor (GCD) computation; 

> We evaluated the complexities of syndromeless decod- 
ing based on different implementation approaches and 
compare them with their counterparts of syndrome-based 
decoding; Both errors-only and errors-and-erasures de- 
coding are considered. 

> We compare the hardware costs and throughputs of direct 
implementations for syndromeless decoders with those 
for syndrome-based decoders. 

The rest of the paper is organized as follows. To make 
this paper self-contained, in Section [IT] we briefly review FFT 
algorithms over finite fields, fast algorithms for polynomial 
multiplication and division over GF(2 m ), the FEEA, and 
syndromeless decoding algorithms. Section [HI] presents both 
computational complexity and decoder architectures of direct 
implementations of syndromeless decoding, and compare them 
with their counterparts for syndrome-based decoding algo- 
rithms. Section IfVl compares the computational complexity of 
fast implementations of syndromeless decoding with that of 
syndrome-based decoding. In Section [V] case studies on two 
RS codes are provided and errors-and-erasures decoding is 
discussed. The conclusions are given in Section [VT] 

II. Background 

A. Fast Fourier Transform Over Finite Fields 

For any n (n \ q — 1) distinct elements a , a\, . . . , a„_! G 
GF(q), the transform from / = (/„, /i, . . . , /„_i) T to F = 
(/(a ),/(ai), • ■ • ,/(a n _x)) , where f(x) = Y%=o e 
GF(g)[x], is called a discrete Fourier transform (DFT), de- 
noted by F = DFT(/). Accordingly, / is called the inverse 
DFT of F, denoted by / = IDFT(F). Asymptotically fast 
Fourier transform (FFT) algorithm over GF(2 m ) was proposed 
in [15]. Reduced-complexity cyclotomic FFT (CFFT) was 
shown to be efficient for moderate lengths in [16]. 

B. Polynomial Multiplication Over GF(2 m ) By Cantor's Ap- 
proach 

A fast polynomial multiplication algorithm using additive 
FFT was proposed by Cantor [13] for G¥(q q ), where q is 
prime, and it was generalized to G¥(q m ) in [11]. Instead of 
evaluating and interpolating over the multiplicative subgroups 
as in multiplicative FFT techniques, Cantor's approach uses 
additive subgroups. Cantor's approach relies on two algo- 
rithms: multipoint evaluation (MPE) [11, Algorithm 3.1] and 
multipoint interpolation (MPI) [11, Algorithm 3.2]. 

Suppose the degree of the product of two polynomials 
over GF(2 m ) is less than h (h < 2 m ), the product can be 
obtained as follows: First, the two operand polynomials are 
evaluated using the MPE algorithm; The evaluation results 
are then multiplied point-wise; Finally the product polynomial 
is obtained by the MPI algorithm to interpolate the point- 
wise multiplication results. The polynomial multiplication 
requires at most |/ilog 2 /i + ^hlogh + 8h multiplications 
over GF(2 m ) and §/ilog 2 h + ^-hlogh + 4/i + 9 additions 
over GF(2 m ) [11]. For simplicity, henceforth in this paper, 
all arithmetic operations are over GF(2 m ) unless specified 
otherwise. 



C. Polynomial Division By Newton Iteration 

Suppose a, b G GF(q)[a;] are two polynomials of degrees 
do + di and di (do, c?i > 0), respectively. To find the quotient 
polynomial q and the remainder polynomial r satisfying a = 
qb + r where degr < d\, a fast polynomial division algo- 
rithm is available [12]. Suppose rev^(a) = x h a(-^), the fast 
algorithm first computes the inverse of rev^fr) mod x do+1 
by Newton iteration. Then the reverse quotient is given by 
q* = rcvda+d,! (a)revd 1 mod x do+1 . Finally, the actual 
quotient and remainder are given by q — revd (q*) and 
r = a — qb. 

Thus, the complexity of polynomial division with remainder 
of a polynomial a of degree da + d\ by a monic polynomial b 
of degree di is at most 4M(d ) + M(di) + 0(d\) multiplica- 
tions/additions when d\ > do [12, Theorem 9.6], where M(/i) 
stands for the numbers of multiplications/additions required to 
multiply two polynomials of degree less than h. 

D. Fast Extended Euclidean Algorithm 

Let To and r\ be two monic polynomials with degro > 
degri and we assume sq = t\ = l,s\ = to = 0. Step i 
(i = 1, 2, • • • , of the EEA computes /Oi+in+i = r^_i —qiri, 
Pi+iSi+i = s i -i-q l Si, and pi+xU+x = ti-\-qiU so that the 
sequence ri are monic polynomials with strictly decreasing 
degrees. If the GCD of ro and r\ is desired, the EEA 
terminates when r; + i = 0. For 1 < i < I, Ri = Qi ■ ■ ■ Q\Ro, 
where Q; = \ —k q j— 1 and Ro = \ n ? 1 ■ Then it can be 

easily verified that Ri = [ s ^ t t*+i.] f° r < z < /. In RS 
decoding, the EEA stops when the degree of i\ falls below 
a certain threshold for the first time, and we refer to this as 
partial GCD. 

The FEEA in [12], [17] costs no more than (22M(/i) + 
0(/i)) log ft. multiplications/additions when no < 2h [14]. 

E. Syndrome-based and Syndromeless Decoding 

Over a finite field GF(q), suppose ao, ax, . . . , a„_i are n 
(n < q) distinct elements and go(x) = nr=o ( x ~ ^ et us 
consider an RS code over GF(g) with length n, dimension 
k, and minimum Hamming distance d = n — fc + 1. A 
message polynomial m(x) of degree less than k is encoded 
to a codeword (cq,c%,--- ,c„_i) with c, = m(ai), and the 
received vector is given by r = (ro, ri, • • • , r n _i). 

The syndrome-based hard decision decoding consists of the 
following steps: syndrome computation, key equation solver, 
the Chien search, and Forney's formula. Further details are 
omitted, and interested readers are referred to [1], [2], [18]. 
We also consider the following two syndromeless algorithms: 

Algorithm 1: [4], [5], [6, Algorithm 1] 

[T]l Interpolation: Construct a polynomial gi(x) with 
deggi(x) < n such that g\ (a^) = r,; for i = 

0,1,. ..,71-1. 

Q]2 Partial GCD: Apply the EEA to g (x) and gi(x), and 
find g(x) and v(x) that maximize deg g(x) while satisfy- 
ing v{x)gi(x) = g(x) mod g (x) and degg(x) < 
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[TJ3 Message Recovery: If v(x) | g(x), the message poly- 
nomial is recovered by m(x) = f^y, otherwise output 
"decoding failure." 
Algorithm 2: [6, Algorithm la] 

[2]l Interpolation: Construct a polynomial gi(x) with 
deggi(x) < n such that g\ (<ij) = r,; for i = 
0,l,...,n-l. 

|2]2 Partial GCD: Find So(x) and si(x) satisfying go(x) = 
x n ~ d+1 So(x)+ro(x) and <7i(x) = x n ~ d+1 s\{x)+r\{x), 
where degro(x) < n — d and degri(x) < n — d. 
Apply the EEA to sq(x) and si(x), and stop when the 
remainder g(x) has degree less than 2^1. Thus, we have 
u(x)«i(x) + u(x)s (x) = g(x). 
|2]3 Message Recovery: If v(x) \ go(x), output "decoding 
failure"; otherwise, first compute q{x) = , and then 
obtain m'(x) = g\{x) + q(x)u(x). If degm'(x) < k, 
output m'(x); otherwise output "decoding failure." 
Compared with Algorithm Q] the partial GCD step of Al- 
gorithm |2] is simpler but its message recovery step is more 
complex [6]. 

III. Direct Implementation of Syndromeless 
Decoding 

A. Complexity Analysis 

We analyze the complexity of direct implementation of 
Algorithms Q] and [2] For simplicity, we assume n — k is even 
and hence d — 1 = 2t. 

First, gi(x) in Steps Q]l and EJl is given by IDFT(r). 
Direct implementation of Steps [TJl and[2]l follows Horner's 
rule, and requires n(n — 1) multiplications and n(n — 1) 
additions [19]. 

Steps Q]2 and |2]2 both use the EEA. The Sugiyama tower 
(ST) [3], [20] is well known as an efficient direct implementa- 
tion of the EEA. For Algorithm[T] the ST is initialized by g\ (x) 
and go(x), whose degrees are at most n. Since the number of 
iterations is 2t, Step Q]2 requires 4t(n + 2) multiplications 
and 2t(n + 1) additions. For Algorithm|2] the ST is initialized 
by Sq(x) and Si(x), whose degrees are at most 2t and the 
iteration number is at most 2t. 

Step Q]3 requires one polynomial division, which can be 
implemented by using k iterations of cross multiplications in 
the ST. Since v(x) is actually the error locator polynomial [6], 
degw(x) < t. Hence, this requires k(k+2t+2) multiplications 
and k(t + 2) additions. However, the result of the polynomial 
division is scaled by a nonzero constant. That is, cross multi- 
plications lead to fh(x) — am(x). To remove the scaling factor 
a, we can first compute I = lc ^^ {x)y where lc(/) 
denotes the leading coefficient of a polynomial /, and then 
obtain m(x) — -fh(x). This process requires one inversion 
and k + 2 multiplications. 

Step|2]3 involves one polynomial division, one polynomial 
multiplication, and one polynomial addition, and their com- 
plexities depend on the degrees of v(x) and u(x), denoted 
as d v and d u , respectively. In the polynomial division, let the 
result of the ST be q(x) = aq(x). The scaling factor is re- 
covered by - = , ,_, , , . Thus it requires one inversion, 
(n — d v + l)(n + d, v + 3) + n — d v + 2 multiplications, and 



(n — d v + l)(d v + 2) additions to obtain q(x). The polynomial 
multiplication needs (n — d v + l)(d u + 1) multiplications 
and (n — d v + l)(d u + 1) — (n — d v + d u + 1) additions, 
and the polynomial addition needs n additions since g-y{x) 
has degree at most n — 1. The total complexity of Step [2]3 
includes (n — d v + l)(n + d v + d u + 5) + 1 multiplications, 
(n — d v + l)(d v + d u + 2) + n — d u additions, and one inversion. 
Consider the worst case for multiplicative complexity, where 
d v should be as small as possible. But d v > d u , so the highest 
multiplicative complexity is (n — d u )(n + 2d u + 6) + 1, which 
maximizes when d u — And we know d u < d v < t. 

Let R denote the code rate. So for RS codes with R > i, 
the maximum complexity is n 2 + nt — 2t 2 + 5n — 2t + 5 
multiplications, 2nt — 2t 2 + 2n+2 additions, and one inversion. 
For codes with R < ^, the maximum complexity is |n 2 + 
| n + -y multiplications, |n 2 + |n + | additions, and one 
inversion. 

Table H] lists the complexity of direct implementation of 
Algorithms [T] and |2 in terms of operations in GF(2 m ). The 
complexity of syndrome-based decoding is given in Table [TTJ 
The numbers for syndrome computation, the Chien search, and 
Forney's formula are from [21]. We assume the EEA is used 
for the key equation solver since it was shown to be equivalent 
to the BMA [22]. The ST is used to implement the EEA. Note 
that the overall complexity of syndrome-based decoding can 
be reduced by sharing computations between the Chien search 
and Forney's formula. However, this is not taken into account 
in Table M 

B. Complexity Comparison 

For any application with fixed parameters n and k, the 
comparison between the algorithms is straightforward using 
the complexities in Tables HI and llll Below we try to determine 
which algorithm is more suitable for a given code rate. 
The comparison between different algorithms is complicated 
by three different types of field operations. However, the 
complexity is dominated by the number of multiplications: 
in hardware implementation, both multiplication and inversion 
over GF(2 m ) requires an area-time complexity of 0(m 2 ) [23], 
whereas an addition requires an area-time complexity of 
0(m); the complexity due to inversions is negligible since 
the required number of inversions is much smaller than 
those of multiplications; the numbers of multiplications and 
additions are both 0(n 2 ). Thus, we focus on the number of 
multiplications for simplicity. 

Since t = ^-^-n and k = Rn, the multiplicative complexi- 
ties of Algorithms Q] and |2] are (3 - R)n 2 + (3 - R)n + 2 and 
i(3i? 2 - 7R + 8)n 2 + (7 - 3R)n + 5, respectively, while the 
complexity of syndrome-based decoding is 5R ~j, 3fl+8 n 2 + 
(2 — 3R)n. It is easy to verify that in all these complexities, 
the quadratic and linear coefficients are of the same order of 
magnitude; hence, we consider only the quadratic terms. Con- 
sidering only the quadratic terms, Algorithm Q] is less efficient 
than syndrome-based decoding when R > g. If the Chien 
search and Forney's formula share computations, this threshold 
will be even lower. Comparing the highest terms, Algorithm [2] 
is less efficient than the syndrome-based algorithm regardless 
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TABLE I 

Direct Implementation Complexities of S yndromeless Decoding Algorithms 





Multiplications 


Additions 


Inversions 


Interpolation 


n(n — 1) 


n(n — 1) 





Partial GCD 


Algorithm 


y. 




4i(n + 2) 


2i(n + 1) 





Algorithm 


|2 




4i(2t + 2) 


2i(2t + 1) 





Message Recovery 


Algorithm 


|1 




(fc + 2)(fc + l) + 2H 


k(t + 2) 


1 


Algorithm 


2 




n 2 + nt - 2t 2 + 5n-2t + 5 


2nt - 2^ + 2n + 2 


1 


Total 


Algorithm 


I 




2n 2 + 2nt + 2n + 2t + 2 


n 2 + 3nt - 2t 2 + n - 2t 


1 


Algorithm 


2 




2n 2 + nt + 6t z + An + 6t + 5 


n 2 + 2nt + 2f + n + 2t + 2 


1 



TABLE II 

Direct Implementation Complexity of Syndrome-Based Decoding 





Multiplications 


Additions 


Inversions 


Syndrome Computation 


2t(n - 1) 


2t(n - 1) 





Key Equation Solver 


4i(2t + 2) 


2t(2t + 1) 





Chien Search 


n(t - 1) 


nt 





Forney's Formula 


2i z 


t(2t - 1) 


t 


Total 


3nt + 10t 2 -n + St 


3nt + 6i 2 - f 


t 



of R. It is easy to verify that the most significant term of the 
difference between Algorithms [I] and [2] is — R ^ l f R ~ 2 ^ n 2 . So 
when implemented directly, Algorithm Q] is less efficient than 
Algorithm |2] when R > |. Thus, Algorithm Q] is more suitable 
for codes with very low rate, while syndrome-based decoding 
is the most efficient for high rate codes. 

C. Hardware Costs, Latency, and Throughput 

We have compared the computational complexities of syn- 
dromeless decoding algorithms with those of syndrome-based 
algorithms. Now we compare these two types of decoding 
algorithms from a hardware perspective: we will compare the 
hardware costs, latency, and throughput of decoder architec- 
tures based on direct implementations of these algorithms. 
Since our goal is to compare syndrome-based algorithms 
with syndromeless algorithms, we select our architectures so 
that the comparison is on a level field. Thus, among various 
decoder architectures available for syndrome-based decoders 
in the literature, we consider the hypersystolic architecture 
in [20]. Not only is it an efficient architecture for syndrome- 
based decoders, some of its functional units can be easily 
adapted to implement syndromeless decoders. Thus, decoder 
architectures for both types of decoding algorithms have the 
same structure with some functional units the same; this 
allow us to focus on the difference between the two types 
of algorithms. For the same reason, we do not try to optimize 
the hardware costs, latency, or throughput using circuit level 
techniques since such techniques will benefit the architectures 
for both types of decoding algorithms in a similar fashion and 
hence does not affect the comparison. 

The hypersystolic architecture [20] contains three functional 
units: the power sums tower (PST) computing the syndromes, 
the ST solving the key equation, and the correction tower (CT) 
performing the Chien search and Forney's formula. The PST 
consists of 2t systolic cells, each of which comprises of one 
multiplier, one adder, five registers, and one multiplexer. The 
ST has 5+1 (S is the maximal degree of the input polynomials) 
systolic cells, each of which contains one multiplier, one 
adder, five registers, and seven multiplexers. The latency of 



the ST is 67 clock cycles [20], where 7 is the number of 
iterations. For the syndrome-based decoder architecture, 5 
and 7 are both 2t. The CT consists of 3t + 1 evaluation 
cells, two delay cells, along with two joiner cells, which also 
perform inversions. Each evaluation cell needs one multiplier, 
one adder, four registers, and one multiplexer. Each delay 
cell needs one register. The two joiner cells altogether need 
two multipliers, one inverter, and four registers. Table [III] 
summarizes the hardware costs of the decoder architecture for 
syndrome-based decoders described above. For each functional 
unit, we also list the latency (in clock cycles), as well as 
the number of clock cycles it needs to process one received 
word, which is proportional to the inverse of the throughput. In 
theory, the computational complexities of steps of RS decoding 
depend on the received word, and the total complexity is 
obtained by first computing the sum of complexities for all 
the steps and then considering the worst case scenario (cf. 
Section IIII-Ab . In contrast, the hardware costs, latency, and 
throughput of every functional unit are dominated by the worst 
case scenario; the numbers in Table [Til] all correspond to the 
worst case scenario. The critical path delay (CPD) is the same, 
T mu it+T add + T mux , for the PST, ST, and CT. In addition to 
the registers required by the PST, ST, and CT, the total number 
of registers in Table [111] also account for the registers needed 
by the delay line called Main Street [20]. 

Both the PST and the ST can be adapted to implement 
decoder architectures for syndromeless decoding algorithms. 
Similar to syndrome computation, interpolation in syndrome- 
less decoders can be implemented by Horner's rule, and thus 
the PST can be easily adapted to implement this step. For 
the architectures based on syndromeless decoding, the PST 
contains n cells, and the hardware costs of each cell remain 
the same. The partial GCD is implemented by the ST. The ST 
can implement the polynomial division in message recovery 
as well. In Step Q]3, the maximum polynomial degree of the 
polynomial division is k + t and the iteration number is at 
most k. As mentioned in Section UlI-AI the degree of q(x) in 
Step [2] 3 ranges from 1 to t. In the polynomial division 9 °>*) , 
the maximum polynomial degree is n and the iteration number 
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TABLE III 

Decoder Architecture Based on Syndrome-Based Decoding (CPD is T mult + T add + T mux ) 





Multipliers 


Adders 


Inverters 


Registers 


Muxes 


Latency 


Throughput 1 


Syndrome Computation 


2t 







lOt 


2t 


n + 6t 


6t 


Key Equation Solver 


2t + l 


2t + 1 





Wt + 5 


14t + 7 


12t 


12t 


Correction 


3t + 3 


3i + 1 


1 


12t + 10 


3t + l 


3t 


3t 


Total 


7t + 4 


7t + 2 


1 


n + 53t + 15 


19t + 8 


n + 21t 


12t 



is at most n — 1. Given the maximum polynomial degree and 
iteration number, the hardware costs and latency for the ST 
can be determined as for the syndrome-based architecture. 

The other operations of syndromeless decoders do not have 
corresponding functional units available in the hypersystolic 
architecture, and we choose to implement them in a straightfor- 
ward way. In the polynomial multiplication q(x)u(x), u(x) has 
degree at most t — 1 and the product has degree at most n — 1. 
Thus it can be done by n multiply-and-accumulate circuits, n 
registers in t cycles (see, e.g., [24]). The polynomial addition 
in Step |2] 3 can be done in one clock cycle with n adders and n 
registers. To remove the scaling factor, Step[T]3 is implemented 
in four cycles with at most one inverter, k + 2 multipliers, and 
k + 3 registers; Step [2j3 is implemented in three cycles with 
at most one inverter, n + 1 multipliers, and n + 2 registers. 
We summarize the hardware costs, latency, and throughput 
of the decoder architectures based on Algorithms Q] and |2] in 
Table HV] 

Now we compare the hardware costs of the three decoder 
architectures based on Tables [III] and [IV] The hardware costs 
are measured by the numbers of various basic circuit elements. 
All three decoder architectures need only one inverter. The 
syndrome-based decoder architecture requires fewer multi- 
plexers than the decoder architecture based on Algorithm [T] 
regardless of the rate, and fewer multipliers, adders, and reg- 
isters when i? > i. The syndrome -based decoder architecture 
requires fewer registers than the decoder architecture based 
on Algorithm |2] when R > ||, and fewer multipliers, adders, 
and multiplexers regardless of the rate. Thus for high rate 
codes, the syndrome-based decoder has lower hardware costs 
than syndromeless decoders. The decoder architecture based 
on Algorithm[T]requires fewer multipliers and adders than that 
based on Algorithmic regardless of the rate, but more registers 
and multiplexers when R > jj. 

In these algorithms, each step starts with the results of the 
previous step. Due to this data dependency, their corresponding 
functional units have to operate in a pipelined fashion. Thus 
the decoding latency is simply the sum of the latency of 
all the functional units. The decoder architecture based on 
Algorithm |2] has the longest latency, regardless of the rate. The 
syndrome-based decoder architecture has shorter latency than 
the decoder architecture based on Algorithm Q] when R > -i. 

All three decoders have the same CPD, so the throughput 
is determined by the number of clock cycles. Since the 
functional units in each decoder architecture are pipelined, 
the throughput of each decoder architecture is determined by 
the functional unit that requires the largest number of cycles. 
Regardless of the rate, the decoder based on Algorithm [2] has 
the lowest throughput. When R > |, the syndrome-based 



decoder architecture has higher throughput than the decoder 
architecture based on Algorithm Q] When the rate is lower, 
they have the same throughput. 

Hence for high rate RS codes, the syndrome-based de- 
coder architecture requires less hardware and achieves higher 
throughput and shorter latency than those based on syndrome- 
less decoding algorithms. 

IV. Fast Implementation of Syndromeless 
Decoding 

In this section, we implement the three steps of Algo- 
rithms 03 and |2] — interpolation, partial GCD, and message 
recovery — by fast algorithms described in Section [TT] and eval- 
uate their complexities. Since both the polynomial division by 
Newton iteration and the FEEA depend on efficient polynomial 
multiplication, the decoding complexity relies on the com- 
plexity of polynomial multiplication. Thus, in addition to field 
multiplications and additions, the complexities in this section 
are also expressed in terms of polynomial multiplications. 

A. Polynomial Multiplication 

We first derive a tighter bound on the complexity of the fast 
polynomial multiplication based on Cantor's approach. 

Let the degree of the product of two polynomials be less 
than n. The polynomial multiplication can be done by two 
FFTs and one inverse FFT if length-n FFT is available over 
GF(2 m ), which requires n | 2 m - 1. If n \ 2 m - 1, one 
option is to pad the polynomials to length n' (n' > n) with 
n' | 2 m — 1. Compared with fast polynomial multiplication 
based on multiplicative FFT, Cantor's approach uses additive 
FFT and does not require n \ 2 m — 1, so it is more effi- 
cient than FFT multiplication with padding for most degrees. 
For n — 2 m — 1, their complexities are similar. Although 
asymptotically worse than Schonhage's algorithm [12], which 
has O(nlognloglogn) complexity, Cantor's approach has 
small implicit constants and hence it is more suitable for 
practical implementation of RS codes [6], [11]. Gao claimed 
an improvement on Cantor's approach in [6], but we do not 
pursue this due to lack of details. 

A tighter bound on the complexity of Cantor's approach 
is given in Theorem [T] Here we make the same assumption 
as in [11] that the auxiliary polynomials Si and the values 
Si((3j) are precomputed. The complexity of pre-computation 
was given in [11]. 

Theorem 1: By Cantor's approach, two polynomials a, b £ 
GF(2 m )[x] whose product has degree less than h (1 < h < 
2 m ) can be multiplied using less than |/ilog 2 h + J^hlogh — 
2h + log h + 2 multiplications, \h log 2 h + ^-h log h - I3h + 
log/i + 15 additions, and 2h inversions over GF(2 m ). 
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TABLE IV 

Decoder Architectures Based on Syndromeless Decoding (CPD is T mult + T add + T mux ) 





Multipliers 


Adders 


Inverters 


Registers 


Muxes 


Latency 


Throughput 1 


Interpolation 


n 


n 





5n 


n 


An 


3n 


Partial GCD 


Alg.LJ 


n+1 


n+1 





5n + 5 


7n + 7 


12i 


12r 


Alg.|2| 


2t + l 


2i + l 





lOt + 5 


14t + 7 


12i 


12r 


Message 
Recovery 


Alg.U 


2k + 1 + 3 


fc + t + 1 


1 


6fc + 5t + 8 


7fc + It + 7 


6fc + 4 


6fc 


Alg. 2 


3n + 2 


3n+ 1 


1 


7n + 7 


7n + 7 


6n + i - 2 


6n 


Total 


Alg. 


JJ 


2n + 2k + t + 4 


2n + k + t + 2 


1 


Wn + 6fc + 5t + 13 


8ri + 7fc + 7t + 14 


4n + 6fc + 12f + 4 


6k 


Alg. 


2 


4n + 2t + 3 


4n + 2t + 2 


1 


12n + 10* + 12 


8n + 14r + 14 


lOn + 13t - 2 


Qn 



Proof: There exists < p < m satisfying 2 P ~ 1 < h < 
2 P . Since both the MPE and MPI algorithms are recursive, 
we denote the numbers of additions of the MPE and MPI 
algorithms for input i (0 < i < p) as S E (i) and Si(i), 
respectively. Clearly S E (0) = 5*7(0) = 0. Following the 
approach in [11], it can be shown that for 1 < i < p, 

S E (i) < i(i + 3)2*~ 2 + {p- 3)(2 4 - 1) + i, (1) 
Si(i) < i(t + 5)T- 2 + (p- 3)(2 4 - 1) + i, (2) 

Let Mg(/i) and Ab(/i) denote the numbers of multipli- 
cations and additions, respectively, that the MPE algorithm 
requires for polynomials of degree less than h. When i = 
p in the MPE algorithm, f(x) has degree less than h < 
2 P , while s p _i is of degree 2 P ~ 1 and has at most p non- 
zero coefficients. Thus g(x) has degree less than h — 2 P_1 . 
Therefore the numbers of multiplications and additions for 
the polynomial division in [11, Step 2 of Algorithm 3.1] are 
both p{h~ 2 P ~ 1 ), while ri(x) = r${x) + Si-i(j3i)g(x) needs 
at most h — 2 P ~ 1 multiplications and the same number of 
additions. Substituting the bound on M£;(2 P_1 ) in [11], we 
obtain M E (h) < 2M E {2' P ~ 1 ) + p(h - 2P' 1 ) + h - 2 P ~\ 
and thus M E (h) is at most \p 2 2 p - \p2? - 2 P + (p + l)h. 
Similarly, substituting the bound on S E (p — 1) in Eq. (fl3, we 
obtain A E (h) < 2S E {p - l)+p{h - 2 p - x ) + h - T>~ 1 , and 
hence A E (h) is at most ip 2 2 p + |p2 p - 4 • 2 P + (p+ l)h + 4. 

Let Mj(h) and A](h) denote the numbers of multiplications 
and additions, respectively, the MPI algorithm requires when 
the interpolated polynomial has degree less than h. When i = 
p in the MPI algorithm, f(x) has degree less than h < 2 P . 
It implies that ro(x) + r\(x) has degree less than h — 2 P ~ 1 . 
Thus it requires at most h — 2 P ~ 1 additions to obtain ro(x) + 
ri(x) and h — 2 P ~ 1 multiplications for ) —1 (ro(x) + 

ri(x)). The numbers of multiplications and additions for the 
polynomial multiplication in [11, Step 3 of Algorithm 3.2] to 
obtain f(x) are both p(h — 2 P_1 ). Adding r$(x) also needs 
2 P_1 additions. Substituting the bound on M/^" 1 ) in [11], 
we have Mi{h) < 2M I {2 p - 1 ) + p(h - 2 P ^) + h- 2 P ~ 1 , 
and hence Mi(h) is at most \p 2 2 p ~ \p2 p - 2 P + (p + l)h. 
Similarly, substituting the bound on Sj(p — 1) in Eq. (f2]), we 
have A^h) < 25/(p - 1) +p(h - 2 P ~ 1 ) +h + l, and hence 
A E (h) is at most \p 2 2 p + \p2 p - 4 • 2 P + (p + l)h + 5. The 
interpolation step also needs 2 P inversions. 

Let M(hi, be the complexity of multiplication of two 
polynomials of degrees less than hi and \i2- Using Cantor's 
approach, M(h 1 ,h 2 ) includes M E (h 1 )+M E (h 2 )+M I (h)+2 p 
multiplications, A E {h\) + A E (h 2 ) + Aj(h) additions, and 2 P 
inversions, when h = hi + hi — 1. Finally, we replace 2 P by 
2/iasin[ll]. ■ 



Compared with the results in [11], our results have the same 
highest degree term but smaller terms for lower degrees. 

By Theorem[T] we can easily compute M(hi) = M(h±, hi). 
A by-product of the above proof is the bounds for the MPE 
and MPI algorithms. We also observe some properties for 
the complexity of fast polynomial multiplication that hold 
for not only Cantor's approach but also other approaches. 
These properties will be used in our complexity analysis 
next. Since all fast polynomial multiplication algorithms have 
higher-than-linear complexities, 2M(h) < M(2h). Also note 
that M(h+l) is no more than M(h) plus 2h multiplications and 
2h additions [12, Exercise 8.34]. Since the complexity bound 
is determined only by the degree of the product polynomial, 
we assume M(hi,h 2 ) < M(\ hl + h2 ]). We note that the 
complexities of Schonhage's algorithm as well as Schonhage 
and Strassen's algorithm, both based on multiplicative FFT, are 
also determined by the degree of the product polynomial [12]. 

B. Polynomial Division 

Similar to [12, Exercise 9.6], in characteristic-2 fields, the 
complexity of Newton iteration is at most 

J2 (M(\(do + 1)2-^1) + M(\(do + l)2-i- 1 D), 

0<j<r-l 

where r = [log(d + 1)1- Since |~(d + 1)2 -5 '] < + 
1)2~ J J + 1 and M(h + 1) is no more than M(h), plus 2h 
multiplications and 2h additions [12, Exercise 8.34], it requires 
at most Ei^<r( M (L(* + 1)2-^J) + M(L(d + l)2--?'- 1 J)), 
P!us EtKjXr-i^LW. + 1)2-^J + 2[(d + l)2--?"- 1 J) multi- 
plications and the same number of additions. Since 2M(/i) < 
M(2h), Newton iteration costs at most Eo<j<r-i(§ M (L( d o + 
1)2~ J J)) < 3M(d + 1), 6(d + 1) muftiplications, and 
6(do + 1) additions. The second step to compute the quotient 
needs M(d + 1) and the last step to compute the remainder 
needs M(d x +l,d + 1) and d x + 1 additions. By Mf^ + 
I, do + 1) < M([ d °+^ 1 + 1), the total cost is at most 
4M(d ) + M([^±^l), 15d Q + d 1 +7 multiplications, and 
1 Ida + 2d\ + 8 additions. Note that this bound does not require 
di > do as m [12]. 

C. Partial GCD 

The partial GCD step can be implemented in three ap- 
proaches: the ST, the classical EEA with fast polynomial 
multiplication and Newton iteration, and the FEE A with fast 
polynomial multiplication and Newton iteration. The ST is 
essentially the classical EEA. The complexity of the classical 
EEA is asymptotically worse than that of the FEEA. Since the 
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FEEA is more suitable for long codes, we will use the FEEA 
in our complexity analysis of fast implementations. 

In order to derive a tighter bound on the complexity of the 
FEEA, we first present a modified FEEA in Algorithm Let 

rj(h) = max{j : Y%=i ^ e S1i < which is the number of 
steps of the EEA satisfying dcg ro — deg < h < deg ro — 

degr 7)(/t)+1 . For f(x) = f n x n -\ h/i£+/o with /„ ^ 0, the 

truncated polynomial f(x) \ h = f„x h ^ \-f n -h+iX+fn-h 

where f t — for i < 0. Note that f(x) \ h = if h < 0. 
Algorithm 3: Modified Fast Extended Euclidean Algorithm 

Input: two monic polynomials ro and r\, with degro = 
no > rti = degri, as well as integer h (0 < h < no) 
Output: / = ij(h),pi +1 ,Ri,ri, and fj +1 

01 If r\ = or h < no — ni, then return 0,1, [J §] , ro, 
and n. 

02 h 1= L|J,r*=r r 2/ii, rl=n t (2/ii - (n - m)). 

03 (j - 1, p*, Jj;_ x , r;_ X) f*) = FEEA(fo,r*, /ii). 

[3l4 P- -1 ] - R* \ r o-r> n °- 2hl ] , 2hl i 



1? 



5-1' 



05 If 



rij = deg -/ 



lc(rj 



= or ft < 



then return j — 



1, pj,Rj-i,rj-x, and 



06 Perform polynomial division with remainder as rj-i 



r j+1 , p j+1 



degr j+1 ,Rj = [_i L U? 



lc(f i+1 ),r i+1 = ^,n J+ i 



07 h 2 = h — (n — 



(2/i 2 - (rij 



*i+i 



))• 



2/i 2 ,r* +1 = 



08 (Z - j, , S* , rf^ , f^ +1 ) = FEEA(r* , r* +1 , h 2 ). 

09 [^j^^n„;-r::;ij+[; r "^;i;:;: 2 ]^ 



■ i 

- 1 o 





-r j + i-r j 



^ \S*,pl+l = pf +1 ic(f ;+ i). 
010 Return 'l, pi +1 , SRj,n,fi +1 . 

It is easy to verify that Algorithm is equivalent to the FEEA 
in [12], [17]. The difference between Algorithm and the 
FEEA in [12], [17] lies in Steps 04, 05, 08, and 010: in 
Steps 05 and 010, two additional polynomials are returned, 
and they are used in the updates of Steps 04 and08 to reduce 
complexity. The modification in Step04 was suggested in [14] 
and the modification in Step 09 follows the same idea. 

In [12], [14], the complexity bounds of the FEEA are 
established assuming no < 2h. Thus we first establish a bound 
of the FEEA for the case no < 2h below in Theorem using 
the bounds we develop in Sections IIV-AI and IIV-BI The proof 
is similar to those in [12], [14] and hence omitted; interested 
readers should have no difficulty filling in the details. 

Theorem 2: Let T(no,h) denote the complexity of the 
FEEA. When n < 2h, T(n ,h) is at most 17M(/i) log/i 
plus (48/i + 2) log h multiplications, (51h + 2) log/i additions, 
and 3h inversions. Furthermore, if the degree sequence is 
normal, T{2h,h) is at most 10M(7i) log/i, (^h + 6)log/i 
multiplications, and (^/i + 3)log/i additions. 

Compared with the complexity bounds in [12], [14], our 
bound not only is tighter, but also specifies all terms of the 
complexity and avoid the big O notation. The saving over [14] 
is due to lower complexities of Steps 06, 09, and 010 as 



explained above. 

The saving for the normal case over [12] is due to lower 
complexity of Step 09. 

Applying the FEEA to ga(x) and g\(x) to find v(x) and 
g(x) in Algorithm [T] we have no — n and h < t since 
degv(x) < t. For RS codes, we always have n > 2t. Thus, the 
condition no < 2h for the complexity bound in [12], [14] is 
not valid. It was pointed out in [6], [12] that so(x) and si(x) as 
defined in Algorithm0can be used instead of go(x) and gi(x), 
which is the difference between Algorithms and Although 
such a transform allows us to use the results in [12], [14], it 
introduces extra cost for message recovery [6]. To compare 
the complexities of Algorithms and we establish a more 
general bound in Theorem 

Theorem 3: The complexity of FEEA is no more than 
34M(L|J)logL|J+M(L^J)+4M([2|)-fl)+2M(L^J) + 
4M(/ l ) + 2M(L|/iJ) + 4M(L|J), (48^ + 4) logLU+9no + 22/i 
multiplications, (51ft, + 4) log|_f J + Hn + 17h + 2 additions, 
and 3h inversions. 

The proof is also omitted for brevity. The main difference 
between this case and Theorem01ies in the top level call of the 
FEEA. The total complexity is obtained by adding 2T(h, ) 
and the top-level cost. 

It can be verified that, when no < 2h, Theorem presents 
a tighter bound than Theorem since saving on the top level 
is accounted for. Note that the complexity bounds in Theo- 
rems and assume that the FEEA solves s/+ir +ti+iri = 
?~2 + i for both t[ + i and s; +1 . If si + i is not necessary, the 
complexity bounds in Theorems and are further reduced 
by 2M(L|J), 3ft, + 1 multiplications, and 4/i + 1 additions. 

D. Complexity Comparison 

Using the results in Sections IIV-AI IIV-BI and IIV-CI we 
first analyze and then compare the complexities of Algo- 
rithms and as well as syndrome-based decoding under 
fast implementations. 

In Steps 01 and 01, gx(x) can be obtained by an inverse 
FFT when n|2 m — 1 or by the MPI algorithm. In the latter 
case, the complexity is given in Section |TV-A| By Theorem 
the complexity of Step 02 is T(n, t) minus the complexity 
to compute s;+i. The complexity of Step 02 is T(2t,t). The 
complexity of Step 03 is given by the bound in Section IIV-BI 
Similarly, the complexity of Step 03 is readily obtained by 
using the bounds of polynomial division and multiplication. 

All the steps of syndrome-based decoding can be imple- 
mented using fast algorithms. Both syndrome computation 
and the Chien search can be done by n-point evaluations. 
Forney's formula can be done by two i-point evaluations plus 
t inversions and t multiplications. To use the MPE algorithm, 
we choose to evaluate on all n points. By Theorem the 
complexity of the key equation solver is T(2t, t) minus the 
complexity to compute s;+i. 

Note that to simplify the expressions, the complexities are 
expressed in terms of three kinds of operations: polynomial 
multiplications, field multiplications, and field additions. Of 
course, with our bounds on the complexity of polynomial 
multiplication in Theorem the complexities of the decoding 
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algorithms can be expressed in terms of field multiplications 
and additions. 

Given the code parameters, the comparison among these 
algorithms is quite straightforward with the above expressions. 
As in Section UlI-BI we attempt to compare the complexities 
using only R. Such a comparison is of course not accurate, but 
it sheds light on the comparative complexity of these decoding 
algorithms without getting entangled in the details. To this end, 
we make four assumptions. First, we assume the complexity 
bounds on the decoding algorithms as approximate decoding 
complexities. Second, we use the complexity bound in Theo- 
rem Q] as approximate polynomial multiplication complexities. 
Third, since the numbers of multiplications and additions 
are of the same degree, we only compare the numbers of 
multiplications. Fourth, we focus on the difference of the 
second highest degree terms since the highest degree terms are 
the same for all three algorithms. This is because the partial 
GCD steps of Algorithms [T| and 12 as well as the key equation 
solver in syndrome-based decoding differ only in the top level 
of the recursion of FEEA. Hence Algorithms Q] and [2] as well 
as the key equation solver in syndrome-based decoding have 
the same highest degree term. 

We first compare the complexities of Algorithms Q] and [2] 
Using Theorem Q] the difference between the second highest 
degree terms is given by |(25i? — 13)n log 2 n, so AlgorithmQ] 
is less efficient than Algorithm [2] when R > 0.52. Similarly, 
the complexity difference between syndrome-based decoding 
and Algorithm Q] is given by |(1 - 31i?)nlog 2 n. Thus 
syndrome-based decoding is more efficient than Algorithm Q] 
when R > 0.032. Comparing syndrome-based decoding and 
Algorithm |2] the complexity difference is roughly —1(2 + 
R)n log 2 n. Hence syndrome-based decoding is more efficient 
than Algorithm [2] regardless of the rate. 

We remark that the conclusion of the above comparison is 
similar to those obtained in Section ITlI-BI except the thresholds 
are different. Based on fast implementations, Algorithm Q] 
is more efficient than Algorithm [2] for low rate codes, and 
the syndrome-based decoding is more efficient than Algo- 
rithms Q] and |2] in virtually all cases. 

V. Case Study and Discussions 
A. Case Study 

We examine the complexities of Algorithms Q] and [2] as well 
as syndrome-based decoding for the (255, 223) CCSDS RS 
code [25] and a (511,447) RS code which have roughly the 
same rate R — 0.87. Again, both direct and fast implementa- 
tions are investigated. Due to the moderate lengths, in some 
cases direct implementation leads to lower complexity, and 
hence in such cases, the complexity of direct implementation 
is used for both. 

Tables [V] and [VT] list the total decoding complexity of 
Algorithms Q] and [2] as well as syndrome-based decoding, 
respectively. In the fast implementations, cyclotomic FFT [16] 
is used for interpolation, syndrome computation, and the Chien 
search. The classical EEA with fast polynomial multiplica- 
tion and division is used in fast implementations since it is 
more efficient than the FEEA for these lengths. We assume 



normal degree sequence, which represents the worst case 
scenario [12]. The message recovery steps use long division 
in fast implementation since it is more efficient than Newton 
iteration for these lengths. We use Horner's rule for Forney's 
formula in both direct and fast implementations. 

We note that for each decoding step, Tables [V] and [VI] not 
only provide the numbers of finite field multiplications, addi- 
tions, and inversions, but also list the overall complexities to 
facilitate comparisons. The overall complexities are computed 
based on the assumptions that multiplication and inversion are 
of equal complexity, and that as in [15], one multiplication is 
equivalent to 2m additions. The latter assumption is justified 
by both hardware and software implementation of finite field 
operations. In hardware implementation, a multiplier over 
GF(2 m ) generated by trinomials requires m 2 — 1 XOR and 
to 2 AND gates [26], while an adder requires m XOR gates. 
Assuming that XOR and AND gates have the same complexity, 
the complexity of a multiplier is 2m times that of an adder 
over GF(2 m ). In software implementation, the complexity can 
be measured by the number of word-level operations [27]. 
Using the shift and add method as in [27], a multiplication 
requires m — 1 shift and m XOR word-level operations, 
respectively while an addition needs only one XOR word- 
level operation. Henceforth in software implementations the 
complexity of a multiplication over GF(2 m ) is also roughly 
2m times as that of an addition. Thus the total complexity 
of each decoding step in Tables [V] and [VI] is obtained by 
N = 2m(N mu i t + N mv ) + N add , which is in terms of field 
additions. 

Comparisons between direct and fast implementations for 
each algorithm show that fast implementations considerably 
reduce the complexities of both syndromeless and syndrome- 
based decoding, as shown in Tables[V1and |VI| The comparison 
between these tables show that for these two high-rate codes, 
both direct and fast implementations of syndromeless decoding 
are not as efficient as their counterparts of syndrome-based 
decoding. This observation is consistent with our conclusions 
in Sections IrlFBl and |TV-Dl 

For these two codes, hardware costs and throughput of 
decoder architectures based on direct implementations of 
syndrome-based and syndromeless decoding can be easily 
obtained by substituting the parameters in Tables [III] and IIVt 
thus for these two codes, the conclusions in Section IIII-CI 
apply. 

B. Errors-and-Erasures Decoding 

The complexity analysis of RS decoding in Sec- 
tions [III] and [IV] has assumed errors-only decoding. We extend 
our complexity analysis to errors-and-erasures decoding below. 

Syndrome-based errors-and-erasures decoding has been well 
studied, and we adopt the approach in [18]. In this approach, 
first erasure locator polynomial and modified syndrome poly- 
nomial are computed. After the error locator polynomial is 
found by the key equation solver, the errata locator polynomial 
is computed and the error and erasure values are computed by 
Forney's formula. This approach is used in both direct and fast 
implementation. 
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TABLE V 

Complexity of Syndromeless Decoding 



(n, fc) 


Direct Implementation 


Fast Implementation 


Algorithm [ 




Algorithm |2 




Algorithm [ 




Algorithm |2 




Mult. 


Add. 


Inv. 


Overall 


Mult. 


Add. 


Inv. 


Overall 


Mult. 


Add. 


Inv. 


Overall 


Mult. 


Add. 


Inv. 


Overall 


(255. 223) 


Interpolation 


64770 


64770 





1101090 


64770 


64770 





11101090 


586 


6900 





16276 


586 


6900 





16276 


Partial GCD 


16448 


8192 





271360 


2176 


1056 





35872 


8224 


8176 


16 


140016 


1392 


1328 


16 


23856 


Msg Recovery 


57536 


4014 


1 


924606 


69841 


8160 


1 


1125632 


3791 


3568 


1 


64240 


8160 


7665 


1 


138241 


Total 


138754 


76976 


1 


2297056 


136787 


73986 


1 


2262594 


12601 


18644 


17 


220532 


10138 


15893 


17 


178373 


(511,447) 


Interpolation 


260610 


260610 





4951590 


260610 


260610 





4951590 


1014 


23424 





41676 


1014 


23424 





41676 


Partial GCD 


65664 


32768 





1214720 


8448 


4160 





156224 


32832 


32736 


32 


624288 


5344 


5216 


32 


101984 


Msg Recovery 


229760 


15198 


1 


4150896 


277921 


31680 


1 


5034276 


14751 


14304 


1 


279840 


31680 


30689 


1 


600947 


Total 


556034 


308576 


1 


10317206 


546979 


296450 


1 


10142090 


48597 


70464 


33 


945804 


38038 


59329 


33 


744607 



TABLE VI 

Complexity of Syndrome-Based Decoding 



(n,k) 


Direct Implementation 


Fast Implementation 


Mult. 


Add. 


Inv. 


Overall 


Mult. 


Add. 


Inv. 


Overall 


(255, 223) 


Syndrome Computation 


8128 


8128 





138176 


149 


4012 





6396 


Key Equation Solver 


2176 


1056 





35872 


1088 


1040 


16 


18704 


Chien Search 


3825 


4080 





65280 


586 


6900 





16276 


Forney's Formula 


512 


496 


16 


8944 


512 


496 


16 


8944 


Total 


14641 


13760 


16 


248272 


2335 


12448 


32 


50320 


(511,447) 


Syndrome Computation 


32640 


32640 





620160 


345 


16952 





23162 


Key Equation Solver 


8448 


4160 





156224 


4224 


4128 


32 


80736 


Chien Search 


15841 


16352 





301490 


1014 


23424 





41676 


Forney's Formula 


2048 


2016 


32 


39456 


2048 


2016 


32 


39456 


Total 


58977 


55168 


32 


1117330 


7631 


46520 


64 


185030 



Syndromeless errors-and-erasures decoding can be carried 
out in two approaches. Let us denote the number of erasures 
as v (0 < v < 2t), and up to / = [ 2t 2 ' t/ j errors can 
be corrected given v erasures. As pointed out in [5], [6], 
the first approach is to ignore the v erased coordinates, 
thereby transforming the problem into errors-only decoding 
of an (n — v, k) shortened RS code. This approach is more 
suitable for direct implementation. The second approach is 
similar to syndrome-based errors-and-erasures decoding de- 
scribed above, which uses the erasure locator polynomial [5]. 
In the second approach, only the partial GCD step is affected, 
while the same fast implementation techniques described in 
Section [IV] can be used in the other steps. Thus, the second 
approach is more suitable for fast implementation. 

We readily extend our complexity analysis for errors-only 
decoding in Sections [Til] and [IV] to errors-and-erasures decod- 
ing. Our conclusions for errors-and-erasures decoding are the 
same as those for errors-only decoding: Algorithm Q] is the 
most efficient only for very low rate codes; syndrome-based 
decoding is the most efficient algorithm for high rate codes. 
For brevity, we omit the details and interested readers will 
have no difficulty filling in the details. 

VI. Conclusion 

We analyze the computational complexities of two syn- 
dromeless decoding algorithms for RS codes using both direct 
implementation and fast implementation, and compare them 
with their counterparts of syndrome-based decoding. With 
either direct or fast implementation, syndromeless algorithms 
are more efficient than the syndrome-based algorithms only for 
RS codes with very low rate. When implemented in hardware, 
syndrome-based decoders also have lower complexity and 
higher throughput. Since RS codes in practice are usually 



high-rate codes, syndromeless decoding algorithms are not 
suitable for these codes. Our case study also shows that 
fast implementations can significantly reduce the decoding 
complexity. Errors-and-erasures decoding is also investigated 
although the details are omitted for brevity. 
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